Search | WHO COVID-19 Research Database

GenSLMs: Genome-scale language models reveal SARS-CoV-2 evolutionary dynamics. (preprint)

Max T. Zvyagin; Alexander Brace; Kyle Hippe; Yuntian Deng; Bin Zhang; Cindy Orozco Bohorquez; Austin Clyde; Bharat Kale; Danilo Perez-Rivera; Heng Ma; Carla M. Mann; Michael Irvin; J. Gregory Pauloski; Logan Ward; Valerie Hayot; Murali Emani; Sam Foreman; Zhen Xie; Diangen Lin; Maulik Shukla; Weili Nie; Josh Romero; Christian Dallago; Arash Vahdat; Chaowei Xiao; Thomas Gibbs; Ian Foster; James J. Davis; Michael E. Papka; Thomas Brettin; Anima Anandkumar; Venkatram Vishwanath; Arvind Ramanathan.

biorxiv; 2022.

Preprint in English | bioRxiv | ID: ppzbmed-10.1101.2022.10.10.511571

ABSTRACT

Our work seeks to transform how new and emergent variants of pandemic causing viruses, specially SARS-CoV-2, are identified and classified. By adapting large language models (LLMs) for genomic data, we build genome-scale language models (GenSLMs) which can learn the evolutionary landscape of SARS-CoV-2 genomes. By pre-training on over 110 million prokaryotic gene sequences, and then finetuning a SARS-CoV-2 specific model on 1.5 million genomes, we show that GenSLM can accurately and rapidly identify variants of concern. Thus, to our knowledge, GenSLM represents one of the first whole genome scale foundation models which can generalize to other prediction tasks. We demonstrate the scaling of GenSLMs on both GPU-based supercomputers and AI-hardware accelerators, achieving over 1.54 zettaflops in training runs. We present initial scientific insights gleaned from examining GenSLMs in tracking the evolutionary dynamics of SARS-CoV-2, noting that its full potential on large biological data is yet to be realized.

Systematic modeling of SARS-CoV-2 protein structures (preprint)

Andrea Schafferhans; Neblina Sikta; Christian Stolte; Sandeep Kaur; Bosco Ho; Stuart Anderson; James B Procter; Christian Dallago; Nicola Bordin; Burkhard Rost; Matt Adcock.

biorxiv; 2020.

Preprint in English | bioRxiv | ID: ppzbmed-10.1101.2020.07.16.207308

ABSTRACT

In response to the COVID-19 pandemic caused by the SARS-CoV-2 virus, structural biologists are using experimental structural determination methods to better understand the viral proteome. Our goal in this work was to help researchers use these rapidly emerging structural data to gain detailed insights into the molecular mechanisms underlying COVID-19 infection. Our analysis was based on the protein sequences defined by UniProt as comprising the viral proteome. We systematically compared each SARS-CoV-2 protein sequence against all available protein 3D structures derived from any organism (164,250 PDB entries), using pairs of hidden Markov models built with the HHblits tool. We found 872 sequence-to-structure alignments assessed to have significant similarity (E < 10e-10) to infer structural similarity. The resulting 872 3D template models now provide a wealth of new detail, currently not available from related resources. To help make this large, complex dataset accessible and usable for other researchers, we also developed a tailored layout strategy to visually organise the 3D models by mapping them to the viral genome. The resulting graph provides an immediate and comprehensive visual overview of what is known - and not known - about the 3D structure of the viral proteome, thereby helping direct future research. The graph also clearly reveals all available structural evidence of viral mimicry or hijacking of human proteins, as well as all evidence of interactions between viral proteins. We have created PDF and online versions of the graph, in which users can click on any node in the graph to open the corresponding 3D model in the Aquaria molecular graphics system. In Aquaria, these models can then be colored to show sequence features, such as single nucleotide polymorphisms and posttranslational modifications. Previous versions of Aquaria showed only features from UniProt; however, as part of this study, we have now added features from PredictProtein and CATH, thus providing a total of 32,717 features for SARS-CoV-2 protein sequences. In this work, we present novel insights found, using the above approach, into how SARS-CoV-2 mimics and hijacks host proteins, and how viral proteins self-assemble during infection. The resulting Aquaria-COVID resource is freely available online at https://aquaria.ws/covid19, and an accompanying video (https://youtu.be/J2nWQTlJNaY) explains how researchers can use the resource.

Subject(s)

COVID-19

ABSTRACT

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL